Template author: Morgan Carr-Markell
Template last modified on: Feb 28, 2023
Notebook modified by: Katie Wang
Notebook last modified on: Mar 7, 2023
The code and many descriptions in this notebook come from a tutorial posted on the (National Ecological Observatory Network) NEON website and originally from the Data Carpentries. You can find the original tutorial here: https://www.neonscience.org/resources/learning-hub/tutorials/introduction-working-raster-data-r#toggle-26
Original Authors: Leah A. Wasser, Megan A. Jones, Zack Brym, Kristina Riemer, Jason Williams, Jeff Hollister, Mike Smorul
Original Last Updated: Apr 8, 2021
We often want to combine values of and perform calculations on rasters to create a new output raster. This tutorial covers how to subtract one raster from another using basic raster math and the overlay() function.
# load libraries
options("rgdal_show_exportToProj4_warnings"="none") # to suppress warning messages
library(raster)
Loading required package: sp
library(rgdal)
Please note that rgdal will be retired during 2023,
plan transition to sf/stars/terra functions using GDAL and PROJ
at your earliest convenience.
See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
rgdal: version: 1.6-4, (SVN revision 1196)
Geospatial Data Abstraction Library extensions to R successfully loaded
Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
Path to GDAL shared files: C:/Users/achro/AppData/Local/R/win-library/4.2/rgdal/gdal
GDAL binary built with GEOS: TRUE
Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
Path to PROJ shared files: C:/Users/achro/AppData/Local/R/win-library/4.2/rgdal/proj
PROJ CDN enabled: FALSE
Linking to sp version:1.6-0
To mute warnings of possible GDAL/OSR exportToProj4() degradation,
use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(rasterVis)
Loading required package: lattice
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.4.0 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2 ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks raster::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::select() masks raster::select()
# load the DTM & DSM rasters
DTM_HARV <- raster("NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif")
DSM_HARV <- raster("NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
For example, if we are interested in mapping the heights of trees across an entire field site, we might want to calculate the difference between the Digital Surface Model (DSM, tops of trees) and the Digital Terrain Model (DTM, ground level). The resulting dataset is referred to as a Canopy Height Model (CHM) and represents the actual height of trees, buildings, etc. with the influence of ground elevation removed.
Note: This is an example of local map algebra!
We can calculate the difference between two rasters in two different ways:
or for more efficient processing - particularly if our rasters are large and/or the calculations we are performing are complex:
We can perform raster calculations by simply subtracting (or adding, multiplying, etc) two rasters. In the geospatial world, we call this “raster math”.
Let’s subtract the DTM from the DSM to create a Canopy Height Model.
# Raster math example
CHM_HARV <- DSM_HARV - DTM_HARV
# plot the output CHM
plot(CHM_HARV,
main="Canopy Height Model - Raster Math Subtract\n NEON Harvard Forest Field Site",
axes=FALSE)
Let’s have a look at the distribution of values in our newly created Canopy Height Model (CHM).
# histogram of CHM_HARV
hist(CHM_HARV,
col = "springgreen4",
main = "Histogram of Canopy Height Model\nNEON Harvard Forest Field Site",
ylab = "Number of Pixels",
xlab = "Tree Height (m) ")
(3 pts) What is the range of all the tree heights?
The tree heights range from 0 to 35.
(3 pts) Briefly describe the distribution (skewed, bimodal,
symmetrical)?
The distribution is skewed left and has two modes: one at 18-20 m and
one at 0-2 m.
Note: I am skipping the overlay function section. We will learn about this in future weeks
Now that we’ve created a new raster, let’s export the data as a GeoTIFF using the writeRaster() function.
When we write this raster object to a GeoTIFF file we’ll name it chm_HARV.tiff. This name allows us to quickly remember both what the data contains (CHM data) and for where (HARVard Forest). The writeRaster() function by default writes the output file to your working directory unless you specify a full file path.
# create a output subdirectory if it doesn't yet exist
if (!dir.exists("output")){
dir.create("output")
}
# export CHM object to new GeotIFF
writeRaster(CHM_HARV,
"output/chm_HARV.tiff",
format = "GTiff", # specify output format - GeoTIFF
overwrite = TRUE, # CAUTION: if this is true, it will overwrite an existing file
NAflag = -9999) # set no data value to -9999
The function arguments that we used above include:
Data are often more interesting and powerful when we compare them across various locations. Let’s compare some data collected over Harvard Forest to data collected in Southern California. The NEON San Joaquin Experimental Range (SJER) field site located in Southern California has a very different ecosystem and climate than the NEON Harvard Forest Field Site in Massachusetts.
DTM_SJER <- raster("NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop.tif")
DSM_SJER <- raster("NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif")
crs(DTM_SJER)
Coordinate Reference System:
Deprecated Proj.4 representation:
+proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
WKT2 2019 representation:
PROJCRS["unknown",
BASEGEOGCRS["unknown",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ID["EPSG",6326]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8901]]],
CONVERSION["UTM zone 11N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-117,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]],
ID["EPSG",16011]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
crs(DSM_SJER)
Coordinate Reference System:
Deprecated Proj.4 representation:
+proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
WKT2 2019 representation:
PROJCRS["unknown",
BASEGEOGCRS["unknown",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ID["EPSG",6326]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8901]]],
CONVERSION["UTM zone 11N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-117,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]],
ID["EPSG",16011]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
hist(DTM_SJER,
xlab = "Terrain height (m)",
ylab = "Number of Pixels")
Warning: 2% of the raster cells were used. 100000 values used.
hist(DSM_SJER,
xlab = "Surface height (m)",
ylab = "Number of Pixels")
Warning: 2% of the raster cells were used. 100000 values used.
(3 pts) Are there obvious outliers?
Based on what the histograms show, there are no obvious
outliers.
(3 pts) Create a canopy height model (CHM) for SJER. Create a CHM from the two raster layers and check to make sure the data are what you expect.
CHM_SJER = DSM_SJER - DTM_SJER
# checking data by making a histogram
hist(CHM_SJER)
plot(CHM_SJER,
main="Canopy Height Model - Raster Math Subtract\n NEON San Joaquin Experimental Range Field Site",
axes = FALSE)
# code copied from above
plot(CHM_HARV,
main="Canopy Height Model - Raster Math Subtract\n NEON Harvard Forest Field Site",
axes=FALSE)
(4 pts) Compare the vegetation structure of the Harvard Forest
and San Joaquin Experimental Range. What is different about the
distributions of canopy heights at the two NEON sites?
At the San Joaquin Experimental Range, the trees are for the most part
very low-lying, with only a few tall individuals. There’s greater
variation in canopy height at the Harvard Forest, and in general the
trees are taller there.
(3 pts) Export the SJER CHM as a GeoTIFF in the output subfolder (like we did above for the HARV_ov_CHM but save it as “output/chm_SJER.tiff”):
writeRaster(CHM_SJER,
"output/chm_SJER.tiff",
format = "GTiff", # GeoTIFF output
overwrite = TRUE, # will overwrite any existing file
NAflag = -9999) # no data value is -9999
Optional: Originally, I had thought we would also do Raster 07, but it’s too much for one homework assignment so the rest is just for you to look through if you are interested:
Note: We are skipping Raster 04-06, but if you want to learn more about time series raster data and working with bands you can find those tutorials here
Original Authors: Leah A. Wasser, Megan A. Jones, Zack Brym, Kristina Riemer, Jason Williams, Jeff Hollister, Mike Smorul
Original Last Updated: Apr 8, 2021
In this tutorial, we will extract NDVI values from a raster time series dataset in R and plot them using ggplot.
If you want to work this tutorial (which, again, is completely optional), you’ll need to first:
In science, we often want to extract summary values from raster data. For example, we might want to understand overall greeness across a field site or at each plot within a field site. These values can then be compared between different field sites and combined with other related metrics to support modeling and further analysis.
The imagery data used to create this raster teaching data subset were collected over the National Ecological Observatory Network’s Harvard Forest and San Joaquin Experimental Range field sites.
The imagery was created by the U.S. Geological Survey (USGS) using a multispectral scanner on a Landsat Satellite. The data files are Geographic Tagged Image-File Format (GeoTIFF).
Let’s load the data and create a rasterStack first.
# Create list of NDVI file paths
all_HARV_NDVI <- list.files("NEON-DS-Landsat-NDVI/HARV/2011/NDVI",
full.names = TRUE,
pattern = ".tif$")
# Create a time series raster stack
NDVI_HARV_stack <- stack(all_HARV_NDVI)
# apply scale factor
NDVI_HARV_stack <- NDVI_HARV_stack / 10000
Our goal in this tutorial, is to create a data.frame that contains a single, mean NDVI value for each raster in our time series. This value represents the mean NDVI value for this area on a given day.
We can calculate the mean for each raster using the cellStats function. The cellStats function produces a numeric array of values. We can then convert our array format output to a data.frame using as.data.frame().
avg_NDVI_HARV <- NDVI_HARV_stack %>%
cellStats(mean) %>% # calculate mean NDVI for each raster
as.data.frame() # convert output array to data.frame
# view data
avg_NDVI_HARV
We now have a data.frame with row.names based on the original file name and a mean NDVI value for each file. Next, let’s clean up the column names in our data.frame to make it easier for colleagues to work with our code.
The column name is not ideal. Let’s change the NDVI column name to MeanNDVI.
# view column name slot
names(avg_NDVI_HARV)
# rename the NDVI column
names(avg_NDVI_HARV) <- "meanNDVI"
# view cleaned column names
names(avg_NDVI_HARV)
While we are only working with one site now, we might want to compare several sites worth of data in the future. Let’s add a column to our data.frame called “site”. We can populate this column with the site name - HARV. Let’s also create a year column and populate it with 2011 - the year our data were collected.
avg_NDVI_HARV <- avg_NDVI_HARV %>%
mutate(site = "HARV") %>% # add a site column to our data
mutate(year = "2011") # add a "year" column to our data
# view data
head(avg_NDVI_HARV)
We now have data frame that contains a row for each raster file processed, and a column for meanNDVI, site and year.
We’d like to produce a plot where Julian days (the numeric day of the year, 0 - 365/366) is on the x-axis and NDVI is on the y-axis. To create this plot, we’ll need a column that contains the Julian day value.
One way to create a Julian day column is to use gsub on the file name in each row. We can replace both the X and the _HARV_NDVI_crop to extract the Julian Day value:
X005_HARV_NDVI_crop
# note the use of the vertical bar character ( | ) is equivalent to "or". This
# allows us to search for more than one pattern in our text strings.
julianDays <- gsub(pattern = "X|_HARV_ndvi_crop", #the pattern to find
x = row.names(avg_NDVI_HARV), #the object containing the strings
replacement = "") #what to replace each instance of the pattern with
# make sure output looks ok
head(julianDays)
And now we can add julianDay values as a column in the data frame:
avg_NDVI_HARV$julianDay <- julianDays
class(avg_NDVI_HARV$julianDay)
Storing this data as a date object would be better - for plotting, data subsetting and working with our data. Let’s convert.
To convert a Julian Day number to a date class, we need to set the origin of the day which “counting” Julian Days began. Our data are from 2011, and we know that the USGS Landsat Team created Julian Day values for this year. Therefore, the first day or “origin” for our Julian day count is 01 January 2011. Once we set the Julian Day origin, we can add the Julian Day value (as an integer) to the origin date.
Since the origin date was originally set as a Date class object, the new Date column is also stored as class Date.
# set the origin for the julian date (1 Jan 2011)
origin <- as.Date("2011-01-01")
# convert "julianDay" from class character to integer
avg_NDVI_HARV$julianDay <- as.integer(avg_NDVI_HARV$julianDay)
# create a date column; -1 added because origin is the 1st.
# If we didn't subtract 1: 01/01/2011 + 5 = 01/06/2011 which is Julian day 6, not 5.
avg_NDVI_HARV$Date <- origin + (avg_NDVI_HARV$julianDay - 1)
# did it work?
head(avg_NDVI_HARV$Date)
class(avg_NDVI_HARV$Date)
class(avg_NDVI_HARV$julianDay)
Note that when we convert our integer class julianDay values to dates, we subtracted 1 as follows: avg_NDVI_HARV\(Date <- origin + (avg_NDVI_HARV\)julianDay - 1) This is because the origin day is 01 January 2011, so the extracted day is 01. The Julian Day (or year day) for this is also 01. When we convert from the integer 05 julianDay value (indicating 5th of January), we cannot simply add origin + julianDay because 01 + 05 = 06 or 06 January 2011. To correct, this error we then subtract 1 to get the correct day, January 05 2011.
We now have a clean data.frame with properly scaled NDVI and Julian days. Let’s plot our data.
We will use the ggplot() function within the ggplot2 package for this plot.
# plot NDVI
ggplot(avg_NDVI_HARV, aes(julianDay, meanNDVI), na.rm=TRUE) +
geom_point(size=4, color = "PeachPuff4") +
labs(title = "Landsat Derived NDVI - 2011\n NEON Harvard Forest Field Site",
x = "Julian Days",
y = "Mean NDVI") +
theme(text = element_text(size=20))
Create a complementary plot for the SJER data. Plot the data points in a different color.
TODO
TODO
TODO
TODO
TODO
As we look at these plots we see variation in greenness across the year. However, the pattern is interrupted by a few points where NDVI quickly drops towards 0 during a time period when we might expect the vegetation to have a larger greenness value. Is the vegetation truly senescent or gone or are these outlier values that should be removed from the data?
Let’s look at the RGB images from Harvard Forest.
NOTE: the code below uses loops which we will not teach in this tutorial. However the code demonstrates one way to plot multiple RGB rasters in a grid.
Note: This may take a little while to run
# open up RGB imagery
rgb.allCropped <- list.files("NEON-DS-Landsat-NDVI/HARV/2011/RGB/",
full.names=TRUE,
pattern = ".tif$")
# create a layout
par(mfrow=c(4,4))
# super efficient code to plot RGB image for each day
for (aFile in rgb.allCropped){
NDVI.rastStack <- stack(aFile)
plotRGB(NDVI.rastStack, stretch = "lin")
}
# reset layout
par(mfrow=c(1,1))
Which days (in terms of 1st, 2nd, last, 2nd-to-last) had very heavy cloud cover? These are arranged left to right, top to bottom.
Do you think it makes sense to include NDVI values from those days in an analysis? Why or why not?
Let’s look at the SJER site:
# open up the cropped files
rgb.allCropped.SJER <- list.files("NEON-DS-Landsat-NDVI/SJER/2011/RGB/",
full.names=TRUE,
pattern = ".tif$")
# create a layout
par(mfrow=c(5,4))
# Super efficient code
# note that there is an issue with one of the rasters
# NEON-DS-Landsat-NDVI/SJER/2011/RGB/254_SJER_landRGB.tif has a blue band with no range
# thus you can't apply a stretch to it. The code below skips the stretch for
# that one image. You could automate this by testing the range of each band in each image
for (aFile in rgb.allCropped.SJER){
NDVI.rastStack <- stack(aFile)
if (aFile == "NEON-DS-Landsat-NDVI/SJER/2011/RGB//254_SJER_landRGB.tif"){
plotRGB(NDVI.rastStack)
} else {
plotRGB(NDVI.rastStack, stretch="lin")
}
}
If we want to only retain points that we think are valid NDVI values, one way to do this is by identifying a threshold value. All values below that threshold will be removed from our analysis. We will use 0.1 as an example for this tutorial. We can then use the subset function to remove outlier datapoints (below our identified threshold).
avg_NDVI_HARV_clean <- avg_NDVI_HARV %>%
subset(meanNDVI > 0.1)
# Did it work?
avg_NDVI_HARV_clean$meanNDVI < 0.1
avg_NDVI_SJER_clean <- TODO
Now let’s combine the cleaned data frames using rbind:
avg_NDVI_clean <- rbind(avg_NDVI_HARV_clean, avg_NDVI_SJER_clean)
TODO
In which months is NDVI highest and lowest at Harvard Forest? What about San Joaquin Experimental Range?
What do you think might drive the seasonal patterns of NDVI at the two sites?